Discussion on machine learning technology to predict tacrolimus blood concentration in patients with nephrotic syndrome and membranous nephropathy in real-world settings

Background Given its narrow treatment window, high toxicity, adverse effects, and individual differences in its use, we collected and sorted data on tacrolimus use by real patients with kidney diseases. We then used machine learning technology to predict tacrolimus blood concentration in order to provide a basis for tacrolimus dose adjustment and ensure patient safety. Methods This study involved 913 hospitalized patients with nephrotic syndrome and membranous nephropathy treated with tacrolimus. We evaluated data related to patient demographics, laboratory tests, and combined medication. After data cleaning and feature engineering, six machine learning models were constructed, and the predictive performance of each model was evaluated via external verification. Results The XGBoost model outperformed other investigated models, with a prediction accuracy of 73.33%, F-beta of 91.24%, and AUC of 0.5531. Conclusions Through this exploratory study, we could determine the ability of machine learning to predict TAC blood concentration. Although the results prove the predictive potential of machine learning to some extent, in-depth research is still needed to resolve the XGBoost model’s bias towards positive class and thereby facilitate its use in real-world settings.


Background
Tacrolimus (TAC, FK506) is a new immunosuppressant that functions by inhibiting the activity of calcineurin and interfering with T cell activation and cytokine transcription after binding to intracellular FK binding protein. Recent studies have shown that TAC is effective in the treatment of a variety of chronic kidney diseases [1,2]. However, its narrow treatment window, high toxicity, adverse effects, and individual differences in pharmacokinetics and pharmacodynamics have hindered its application in clinical treatment. Therefore, in clinical use, monitoring the blood concentration, adjusting the treatment plan, and administering individualized dosages of TAC are necessary to achieve the best treatment effect [3]. Real-world medical data are widely stored in hospital information systems, which include comprehensive diagnostic and treatment information. The optimization, upgrading, and popularization of hospital information systems not only provide a basis for the medical treatment of patients but also supply real-world data for retrospective research. Machine learning (ML) is a set of computer algorithms driven by data [4]. Its algorithms include the following: artificial neural network, decision tree, random forest, and support vector machine. ML is suitable for analyzing and mining real-world data in enormous quantities, high dimensions, complex relationships, and diverse forms. The rapid speed and strong generalizability of ML support its wide use in clinical decision-making. The application of ML algorithms to individualized medicine will aid in the understanding of precision medicine in clinical practice [5,6]. The purpose of this study was to explore the influencing factors of TAC blood concentration in real-world settings using ML technology to predict TAC blood concentration and assist clinicians in adjusting TAC dosage, ensuring patient safety, and reducing adverse drug reactions.

Study population
The data of patients with nephrotic syndrome and/or membranous nephropathy treated with TAC in PLA General Hospital from January 1, 2013, to December 31, 2020, were collected retrospectively. The inclusion criteria were as follows: (1) diagnosis of nephrotic syndrome or membranous nephropathy; and (2) administration of TAC during hospitalization. The exclusion criteria were as follows: (1)  The data mining and modeling processes are shown in Fig. 1. Following the cleaning step, the final data set comprised 913 patients and the blood TAC concentrations from 1829 blood tests. Data from January 1, 2013, to December 31, 2019, including 821 patients and 1,649 blood tests, were randomly divided into a training set and a test set at an 8:2 ratio. The data from January 1, 2020, to December 31, 2020, including 115 patients and 180 blood tests, were used as the external validation set (Fig. 2).

Data extraction
The relevant patient information was extracted from the database, including demographic, laboratory, and medical order information. Demographic information included data on age, sex, height, and weight. The laboratory information included the blood TAC concentrations, serum creatinine levels, sample receiving times, and result indicators. The medical order information included the name of the medication, dose, frequency of administration, and start and end times of the treatment. Because the medical order consisted of long-term information, it was split by frequency and processed into time-series data. To facilitate data processing, we stored patient hospitalization information in a tree structure rather than a two-dimensional table to build the data set ( Fig. 3).

Data processing
First, data distribution was drawn according to the demographic information, and samples with outliers were deleted. Second, the medication and laboratory information were associated according to time. When there were multiple administrations of TAC before the collection of blood samples, we selected the data from the last TAC administration before sample collection to ascertain the test results matched the corresponding TAC administration. Additionally, a box plot was drawn for the time interval between the last administration and samplereceiving time, and only the samples between quartile 1 (Q1) and quartile 3 (Q3) were reserved to eliminate samples whose medication information was not related to the laboratory information. Seven doses of TAC were administered, and we organized them by the frequency of use as follows: 2.0, 1.0, 1.5, 3.0, 0.5, 2.5, and 4.0 mg.
In terms of combined medication, we extracted information on some of the most commonly prescribed medications by clinicians, including compound α-ketoacid, Shenyankangfu, ShenYanShu, Shenshuaining, Huangkui, Bailing, pidotimod, methylprednisolone, prednisone acetate, mycophenolate mofetil, and tripterygium glycoside. The variables of combined medication were dummy variables. Patients who had used one of these drugs between two blood tests were recorded as 1, and those who did not were recorded as 0.
Although blood concentration is a continuous variable, it was treated as a dummy variable in this study and classified according to the safe range of blood drug concentration [7,8]. Concentrations were defined as 0 within the safety range, and those outside the safety range were defined as 1.
In this study, the blood concentration ratio of TAC classes 0 to 1 was unbalanced at 3:7. Therefore, we used the over-sampling method, SMOTE (Synthetic Minority Oversampling Technique), to balance the data. The core of SMOTE is to insert randomly generated new samples between those of minority and adjacent categories to increase the number of minority categories and improve the unbalanced distribution of the data set [9]. As XGBoost and LightGBM (LGBM) algorithms have hyperparameters for processing unbalanced data, we directly adjusted the super parameters without additional SMOTE processing of data for these algorithms.

Feature selection
The extracted variables included demographic information (age, sex, height, and weight), laboratory information (numerical results and collection time of blood TAC concentration and serum creatinine levels), medical order information (drug name, medication time, and dose), and medication combinations. Various tools from different models were used to calculate the importance value of each factor. For example, logistic regression (LR), random forest, and AdaBoost  (adaptive boosting) used the eli5 Library in SK-learn to visually display the value of each feature, whereas XGBoost and LGBM used their own algorithms. We removed the features with relatively low importance to reduce the feature dimension, simplify the model, and improve its generalization ability.

Model building
Classification algorithms in supervised learning included LR, artificial neural network, Naïve Bayes, and integration algorithms. In this study, six ML models, LR, random forest, AdaBoost, gradient boost decision tree, XGBoost, and LGBM, were established to classify and predict the blood concentration of TAC. All models except for LR belonged to the Ensemble Algorithms, which integrate several weak classifiers into one strong classifier. The Ensemble Algorithms have rapid speed and strong generalization ability, and they are suitable for application in many fields, including medical diagnosis [10].
In the process of model establishment, Grid Search was used to choose the hyperparameter of the model. Grid Search uses an exhaustive method to train the learner with the hyperparameter in the user-defined range, and then find the optimal value for the hyperparameter within this range. Table 1 lists the core hyperparameters of the six models. In addition, the threshold was continuously adjusted to achieve the best performance of the model.

Model assessment
The evaluation criteria of binary factors generally include accuracy, precision, recall, F-1 score, and area under the curve (AUC) and come from the confusion matrix (Table 2). Accuracy refers to the prediction accuracy of positive sample results and was calculated as follows: where TP is true positive, TN is true negative, FP is false positive, and FN is false negative.
(1) Accuracy = TP + TN TP + TN + FP + FN Recall refers to how many positive samples in the data set are identified and can be calculated as follows: In the ideal state, accuracy and recall are as high as possible; however, the two factors are inversely related, and a balance must be achieved. Therefore, the F-beta score was used to reflect the comprehensive situation of the model. The F-beta score was calculated using the following formula: where precision is calculated using Eq. 4, β equals 1, and the F-beta score is calculated using Eq. 5.
When the accuracy and recall are equally important, they are given the same weight, that is, beta = 1 (F-1 score). However, in this study, type II errors were particularly important. Thus, we closely monitored situations in which patients with abnormal blood concentrations were not assessed, which had a negative effect on the treatment outcomes. Type II errors were generally measured by recall. Therefore, in this study, greater weight was given to recall, where beta = 2 (F-2 score). The F-beta score was > 0 and < 1, and the larger the value, the better the performance of the model. Finally, when the AUC was > 0.5, the model was meaningful. AUC can be calculated as follows: where true positive rate (TPR) and false positive rate (FPR) are calculated using Eqs. 7 and 8, respectively.

Baseline information
Data from 913 patients and 1829 blood tests were included in this study. The baseline information of the study population is shown in

Model performance
The prediction performance of the six models is shown in Table 4. In terms of accuracy, only XGBoost and LGBM displayed an accuracy of > 70%; the accuracy of XGBoost  was higher than that of LGBM at 73.33%. The accuracy of the other models was low, and the effect was poor. We evaluated type II errors through the recall rate. A higher recall rate means that more patients with abnormal blood drug concentrations were correctly predicted, and clinicians can therefore adjust the dosage to reach effective and safe blood drug concentrations. However, when the probability of type II errors was low, the probability of type I errors increased. Therefore, XGBoost performed the best in balancing type I and II errors (F-beta score = 0.9124). In addition, the AUC value of XGBoost was the highest among all models. Therefore, considering the generalization ability and accuracy of the model, we believe that the XGBoost model is ideal for predicting the blood concentration of TAC. Table 5 shows the performance of the XGBoost model under different quantitative features. The features were selected from top to bottom according to the feature importance of the XGBoost model. Although the recall rate of the model was 1 when the number of features in the model was three or less, the AUC was only 0.5, and the model was extremely poor with no effective discriminative ability. Thus, very few features will lead to the underfitting of the model. With an increase in the number of features during modeling, the evaluation indexes in Table 5 increased even if they slightly fluctuated. When the number of features was eight, all evaluation indexes were maximized (accuracy = 0.7333, F-beta = 0.9124, and AUC = 0.5531), and the performance of the model was the best. When the number of features increased beyond eight, the evaluation indexes decreased overall. Thus, too many features weakened the generalization ability of the model, causing the overfitting phenomenon. Therefore, the performance of the model was optimized when using the top eight features for modeling. As shown in Fig. 4, the top eight features in the XGBoost model in descending order were serum creatinine level, weight, age, height, TAC dosage, pidotimod, Bailing, and Huangkui usage. Among them, serum creatinine level was nearly twice as important as any other feature, indicating that serum creatinine has a significant    effect on the blood concentration of TAC. Weight, age, and height were also more important than many other characteristics, whereas sex and some combined medications had relatively little influence on the model.

Discussion
This study revealed that the XGBoost model-with an accuracy of 0.7333 and an F-beta score of 0.9124showed the best effect that could be used to monitor the blood concentration of TAC. Zheng et al. [11] also achieved the best results in regression prediction of TAC blood concentration from real-world data using the XGBoost model. Thus, the XGBoost model has certain advantages for clinical data prediction in real-world settings.
The feature importance ranking of XGBoost revealed that the serum creatinine level of the patients with kidney diseases, particularly nephrotic syndrome and membranous nephropathy, had a significant effect on their blood TAC concentration, thus confirming that the blood TAC concentration is positively correlated with the serum creatinine level [12]. Weight and height also ranked high, in this study, as factors that affect the blood TAC concentration, which is consistent with the results from Zheng et al. [11,13]. Patient age is routinely evaluated by researchers [14,15]. In this study, it ranked third among all features. Finally, the importance of sex in the prediction model was relatively low and did not participate in the establishment of the final model.
Previous studies have focused on the effect of TAC combined with other drugs [16,17], but did not evaluate the effect of the combination on blood TAC concentration. Our study showed that the combination of Bailing and Huangkui with TAC affects blood TAC concentration. However, although pidotimod also had a high importance value in our study, there are no reports to support this result. It is speculated that it may be related to the medication habits of physicians. These conclusions warrant future research.
In the last decade, a few studies have described the prediction of TAC concentration in the blood using ML technology. Additionally, the models used in previous research were mostly artificial neural networks and regression models [18,19], the amount of data obtained was lower, the models were not verified externally, and the research is still in the exploratory stage. In this study, using patients with nephrotic syndrome and membranous nephropathy as examples, blood TAC concentration was classified according to the safe blood concentration range and predicted using a variety of ML models. The number of real-world samples included in this study was considerably more than that in previous research, and an external validation set was used to verify the model. Thus, the model results are more authentic and have clinical significance over previous models.
This study had several limitations. First, owing to the lack of information about blood sample collection time, we had to use the sample-receiving time. Ideally, the laboratory department can obtain the sample collection time in the future to further strengthen the integrity and analyzability of medical data. Second, more laboratory and genetic data should be analyzed.

Conclusion
In this study, an ML model was established to classify the blood TAC concentration in patients with nephrotic syndrome and membranous nephropathy. The oversampling method was used to manage unbalanced data, the variables were screened according to their importance value, and the performance of the six models was compared. Finally, XGBoost was selected as the best prediction model, considering its accuracy of 0.7333, F-beta score of 0.9124, and AUC of 0.5531, which were higher than those of other models, demonstrating a better prediction ability. In the XGBoost model, serum creatinine, weight, age, height, TAC dose, and the use of pidotimod, Bailing, and Huangkui were the main influencing factors of blood TAC concentration. The low AUC and high sensitivity of the model also implies that it is biased towards positive class, which may have a negative impact on the prediction of clinical dose of TAC in patients with negative class. In this exploratory study, the ability of machine learning in predicting TAC blood concentration was investigated. The study findings prove the predictive potential of machine learning to a certain extent; however, further in-depth research is needed to resolve the model's bias towards positive class. LGBM: LightGBM.